1 Introduction

This is the in-the-weeds approach. In this document I will dig into the problem and create several functions that I can save to use later in a more streamlined document.

2 High level summary

What do we have in the training set?

##          fname        label manually_verified
## 1 00044347.wav       Hi-hat                 0
## 2 001ca53d.wav    Saxophone                 1
## 3 002d256b.wav      Trumpet                 0
## 4 0033e230.wav Glockenspiel                 1
## 5 00353774.wav        Cello                 1
## 6 003b91e8.wav        Cello                 0

The description of the data says that “a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology … Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category”. And that “The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category.”

With that in mind, I would prefer to use only the manually verified data.

##           fname              label manually_verified
## 2  001ca53d.wav          Saxophone                 1
## 4  0033e230.wav       Glockenspiel                 1
## 5  00353774.wav              Cello                 1
## 7  003da8e5.wav              Knock                 1
## 8  0048fd00.wav Gunshot_or_gunfire                 1
## 11 006f2f32.wav             Hi-hat                 1

With that reduction, how many do we have in each category?

There are 67 sound clips labeled “Bark”.

Of course, all of these files are in the inputs folder.

Grab all of the sounds of dogs barking.

3 Example dog bark

## Warning: package 'audio' was built under R version 3.4.4
## Class 'audioSample'  atomic [1:620928] -3.05e-05 3.05e-05 0.00 -3.05e-05 -3.05e-05 ...
##   ..- attr(*, "rate")= int 44100
##   ..- attr(*, "bits")= int 16

The “rate” of 44,100 means that the pressure in front of the microphone was measured 44,100 times per second. Thus, the entire clip is 14 seconds long, and has 620,928 values.

This means we can assign each value a time in seconds.

We’ll have to flip between index as an x-axis, and time in seconds.

Zoom in to the range marked by the two red lines.

4 Break up sound file

A general bump in a sound wav view is referred to as an amplitude envelope. We need a way to automatically pick out these units of sound within a sound file.

## [1] 264

From how its described, that function should be just what I need, but it found an awful lot of peaks, and I don’t see an easy way to merge adjascent peaks.

So I considered another method that wasn’t necissarily made for sound but it looks like it will work.

Pick out amplitude envelopes using the findpeaks function.

Take a look at the results from findpeaks.

I was able to get reasonable peaks with findpeaks, but I didn’t like the start and end that was assigned to each peak. I want to use the peaks from findpeaks, but define the start and stop of each peak based on when the sound returns to some threshold.

To do this, I want to look for continuous areas of low values.

A value is only true if the surrounding n values are also true.

##  [1]    NA FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [12] FALSE FALSE FALSE  TRUE  TRUE    NA

Now define start and stop of peask based on when there is a continuous low value.

getPeaksEdges <- function(vals, peaks, threshold, minNumBelow=20){
    # vals - numeric vector, the sound values.
    # peaks - the indecies for the peaks
    # threshold - scaler, below this value is considered quite, a pause.
    # minNumBelow - integer, number of consecutive values below threshold to be called a pause.
    ######
    
    # Make a matrix where each position in the sound is a row and each peak is a column.
    # Make a boolean matrix describing weather each position is below threshold.
    # the signal may dip below the threshold, only consider it a pause if it stays below threshold
    nrow=length(vals)
    ncol=length(peaks)
    bt1 = vals < threshold
    bt2 = noLonelyTrue(bt1, n=minNumBelow, naEdges = F)
    bt = matrix(data=bt2, byrow = F, nrow=nrow, ncol=ncol)
    
    # Make a matrix with the same layout describing weather the value (row) is before a give peak (column).
    beforePeak = rep(1:length(vals), length(peaks)) < rep(peaks, each=length(vals))
    bp = matrix(data=beforePeak, byrow=F, nrow=nrow, ncol=ncol)
    
    # Combine these to get values that are pause areas before the peak.
    # For each column (peak) keep the true value with highest posible index, that is the peak start.
    isBefore = bt & bp
    startInd = apply(isBefore, 2, function(x){max(which(x))})
    
    # Use the mirror image of the process above to get the end for each peak.
    isAfter = bt & !bp
    endInd = apply(isAfter, 2, function(x){min(which(x))})

    return(data.frame(peak=peaks, startInd=startInd, endInd=endInd))
}
toolkitList = c(toolkitList, "getPeaksEdges")
peakEnds = getPeaksEdges(numB1, peaks=allEnv$peak, threshold=threshold)

plot(timesB1, numB1, 
         type="l", col="darkblue", las=1, xlim=xlim, xlab=xlabTime, ylab=ylab)
points(allEnv$peakTime, allEnv[,"height"], col="red", pch=16)
abline(h=threshold, col="pink")
abline(v=getTimes(numB1)[peakEnds$startInd], col="green")
abline(v=getTimes(numB1)[peakEnds$endInd], col="red")

With this method, there are some peaks that are close enough to each other that they have the same start and end postions. I don’t want to copy the same slice and mistake it for a repeating pattern, so I’m going to remove those duplicates.

This gives us 20 amplitude envelopes in this file, and most of them look reasonable.

5 Fourier transform

5.1 generic

5.1.4 Optimize smoothing parameter - span

I know I can count on having a lot of little hills that don’t matter, so I’m going take the 75th percentile and use that a bench mark for “surely small”. Any maxima that are no higher than that, can be considered noise.

Smooth out the values to remove unimportant peaks.

Using a span of 0.07 is just enough to only get the main peaks. We still get the right number of peaks even with a span of more than twice that. As we increase the span, we see the values we would get from our maxima shift away from the accurate values (the red dots move away from the red, green and gold lines).

Now we need a rule a computer can follow that will help it come to similar conclusions about optimizing the span for a given clip. There is probably some really good algorithm for doing that, but for now this will do:

Based on this logic, 0.068 is the optimum span parameter for this clip.

And based on that value for span, the frequencies of interest are:

## [1] 0.798 1.500 1.997

5.1.7 intensity over time line plot

For just the frequences of intersest, plot their intensity over time (across the sliding windows).

The only thing we did to optimize the window size and shift size was… well nothing. I looked at the picture and adjusted it. I don’t know any good rules there, except hope that the same parameters work well over most cases.

5.1.9 Plot recovered waves

Not perfect… but not too bad. I think real sound data will have more cycles per unit, and I think that will mean an increase in resolution. It would also look cleaner to have used a cutoff rather than scaling the amplitude by the intensity, but I don’t think that is as true for real data as it is for this simulation.

Here’s the original for comparison:

5.2 dog bark

The original wave (g) is represented by our input data. For t, we need to create a vector of time measurements corresponding to the values in our sound clip.

Based on Wikipedia’s Audio Frequency page, sound frequencies that we can hear will be covered by a range of about 20 to 20,000 Hz.

Do the Fourier Transform for the arf sample.

Which frequencies are of interest?

## [1] 1780   60 1020  500 1420

6 Other stuff

Smothing by maxes–this is something we might explore as a way to summarize an overal shape.

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] audio_0.1-5.1 seewave_2.1.0 pracma_2.1.4 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17    digest_0.6.15   rprojroot_1.3-2 MASS_7.3-50    
##  [5] backports_1.1.2 signal_0.7-6    magrittr_1.5    evaluate_0.10.1
##  [9] stringi_1.2.3   rmarkdown_1.10  tools_3.4.1     tuneR_1.3.3    
## [13] stringr_1.3.1   yaml_2.1.19     compiler_3.4.1  htmltools_0.3.6
## [17] knitr_1.20